vision and pattern recognition
Supplementary Material for Bridging the Domain Gap: Self-Supervised 3DScene Understanding with Foundation Models Anonymous Author(s) Affiliation Address email
The masking strategy is set to random and the mask4 ratio m is 60 %.5 Embedding: To embed each masked point patch, the Point-MAE method substitutes it with a mask6 token that is learnable and weighted-shared. Meanwhile, for unmasked point patches (i.e., those that7 are visible), Point-MAE employs a lightweight PointNet [8] to extract features from the point patches.8 The visible point patches Pv are hence embedded into visible tokens Tv:9 Tv = PointNet(Pv) (1) Backbone: The backbone of Point-MAE is entirely based on standard Transformers, with an10 asymmetric encoder-decoder. The encoder takes visible tokens Tv as input to generate encoded11 tokens Te. In addition, Point-MAE incorporates positional embeddings into each Transformer block,12 thereby adding location-based information.
Mip-NeRF 360 Ours GT w/o diffusionw/o background Ours GT PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation
The BlendedMVS [7] dataset is a large-scale synthetic dataset for multi-view 6 stereo containing 113 scenes, which can be further divided into large-scale outdoor scenes part and 7 small-scale objects part according to the scene scale. Since current large-scene NeRF methods are 8 one model per scene, to save computational resources and time, we select the first five scenes of the 9 large-scale outdoor scenes part and compare with Mip-NeRF 360 [2], which is the optimal baseline 10 on the representative subset of OMMO dataset [3] as shown in our manuscript, see Tab. 4 and Figure 1 .